The below plot shows the evolution of water level over 21 years, from January 1st 2000 to August 1st 2021. The highest peaks have mainly been observed in the summer months, with the highest peaks recorded occuring in August 9th 2007 at 329.323, August 25th 2005 at 328.827, and in July 14th 2021 at a value of 328.622
The below output shows the five-number statistics summary of the water level values. We immediately understand that the values are not symmetrical around the mean but much more tightly grouped on the lefts side of the mean (and the median) than on its right side, indicating that the water level values’s distribution is right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 325.4 325.6 325.8 325.9 326.1 329.3
The below histogram (i.e. frequency distribution) confirms the above observations. Indeed, we see that the data is right-skewed (i.e. non-normal) indicating that the data is disproportionally distributed on the right where water level outliers need to be investigated.
On top of the histogram we add a red curve being the smoothed version of the histogram showing again the shape of the distribution, a purple line indicating the median water level value (value that splits the observation in half) and the average water level in green.
Looking at the positions on the x-axis of the mean and the median [add mode at some point], we see that the mean seems to be a better indication of the center of the distribution. [adapt interpretation when mode is on].
[peaks-over-threshold method] The Peaks-over-Threshold method identifies extreme values that are above a designated threshold u. In order to determine an optimal threshold we will apply the MRL-plot and then look at the distribution of the data points. The value of u above which the plot is approximately linear can generally be selected as the optimal threshold. So,this is what we are going to do to model the high water levels: first we will make a MRL-plot to choose the optimal threshold and then use the Peak-over-Threshold method.
[clustering of the extremes] Clusters of the extremes correspond to the clustering of the data points that are above the chosen threshold u. Consecutive threshold exceedances are considered to belong to the same cluster. In our case, concerning the daily water levels data, by using the Peak-over-Threshold approach we can observe thanks to the plot the different clusters of the extreme values, then we can fit the Generalized Pareto Distribution (GPD) to the cluster maxima (after declustering if the exceedances exhibit autocorrrelation).
[drawbacks and advantages of using block maxima method instead ]
The threshold could be put at around 326 or 327.